Challenge 5

challenge_5
cereal
public_schools
Introduction to Visualization
Author

Mekhala Kumar

Published

August 22, 2022

library(tidyverse)
library(ggplot2)
library(readr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
cereal <- read_csv("_data/cereal.csv")
str(cereal)
spec_tbl_df [20 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Cereal: chr [1:20] "Frosted Mini Wheats" "Raisin Bran" "All Bran" "Apple Jacks" ...
 $ Sodium: num [1:20] 0 340 70 140 200 180 210 150 100 130 ...
 $ Sugar : num [1:20] 11 18 5 14 12 1 10 16 0 12 ...
 $ Type  : chr [1:20] "A" "A" "A" "C" ...
 - attr(*, "spec")=
  .. cols(
  ..   Cereal = col_character(),
  ..   Sodium = col_double(),
  ..   Sugar = col_double(),
  ..   Type = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
print(summarytools::dfSummary(cereal,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

cereal

Dimensions: 20 x 4
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
Cereal [character]
1. All Bran
2. Apple Jacks
3. Captain Crunch
4. Cheerios
5. Cinnamon Toast Crunch
6. Corn Flakes
7. Crackling Oat Bran
8. Fiber One
9. Froot Loops
10. Frosted Flakes
[ 10 others ]
1(5.0%)
1(5.0%)
1(5.0%)
1(5.0%)
1(5.0%)
1(5.0%)
1(5.0%)
1(5.0%)
1(5.0%)
1(5.0%)
10(50.0%)
0 (0.0%)
Sodium [numeric]
Mean (sd) : 167 (77.3)
min ≤ med ≤ max:
0 ≤ 180 ≤ 340
IQR (CV) : 65 (0.5)
15 distinct values 0 (0.0%)
Sugar [numeric]
Mean (sd) : 8.8 (5.3)
min ≤ med ≤ max:
0 ≤ 9.5 ≤ 18
IQR (CV) : 8.5 (0.6)
15 distinct values 0 (0.0%)
Type [character]
1. A
2. C
10(50.0%)
10(50.0%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-22

The dataset contains details about the type of cereal as well as sugar and sodium contents of cereals for different brands of cereal. It has 20 observations and 4 variables The data is already tidy. The values in the Sodium and Sugar columns can be kept as they are since this will help visualise data in an easier way. Moreover, there are only 2 types so the variable Type does not to be turned into a factor.

The first barplot shows us that in this dataset, there are an equal number of cereals of Type A and Type C. Here, I am assuming that the unit of measurement depicted for sodium is milligram and for sugar is gram. The graphs show us that most of the cereals have a sodium content between 150 to 250 mg and a sugar content between 5 to 15g.

ggplot(cereal, aes(Type)) + geom_bar()

ggplot(cereal, aes(Sodium)) + geom_histogram(binwidth=100,aes(y = ..density..))+
  geom_density(alpha = 0.2, fill="blue")

ggplot(cereal, aes(Sugar)) + geom_histogram(binwidth=10,aes(y = ..density..))+
  geom_density(alpha = 0.2, fill="blue")

Three graphs have been depicted.
The first graph depicts the sodium and sugar content. A scatterplot was used since both the variables are continuous. There seems to be no specific pattern such as a cereal with more sugar having lower sodium. Hence, there is no definite relation between the sugar and sodium content.
The second and third graph depict the relation between the type of cereal and sodium and then type of cereal and sugar. Here boxplots have been used since ‘Type’ is a nominal variable.
For both sodium and sugar, the median content is higher for the Type C cereal than the Type A cereal. Thus we can conclude that Type C cereals generally have higher levels of sodium and sugar. However, the difference between the median levels of sodium in Type A and Type C cereals is much lower than that of difference found between the mean levels of sugar in Type A and Type C cereals.
Another observation to note is that in Type A cereals, for sodium, most of the values lied in the first quartile whereas for sugar,most of the values lied in the third quartile. In Type C cereals, for both sodium and sugar, most values lie in the first quartile.

ggplot(cereal, aes(Sodium,Sugar)) + geom_point()

ggplot(cereal, aes(Type,Sodium)) + geom_boxplot()

ggplot(cereal, aes(Type,Sugar)) + geom_boxplot()

Public School Characteristics

PublicSchoolChar <- read_csv("_data/Public_School_Characteristics_2017-18.csv")
dim(PublicSchoolChar)
[1] 100729     79
print(summarytools::dfSummary(PublicSchoolChar,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

PublicSchoolChar

Dimensions: 100729 x 79
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
X [numeric]
Mean (sd) : -92.9 (16.9)
min ≤ med ≤ max:
-176.6 ≤ -89.3 ≤ 144.9
IQR (CV) : 20.2 (-0.2)
97136 distinct values 0 (0.0%)
Y [numeric]
Mean (sd) : 37.8 (5.8)
min ≤ med ≤ max:
-14.3 ≤ 38.8 ≤ 71.3
IQR (CV) : 7.7 (0.2)
97136 distinct values 0 (0.0%)
OBJECTID [numeric]
Mean (sd) : 50365 (29078.1)
min ≤ med ≤ max:
1 ≤ 50365 ≤ 100729
IQR (CV) : 50364 (0.6)
100729 distinct values 0 (0.0%)
NCESSCH [character]
1. 010000500870
2. 010000500871
3. 010000500879
4. 010000500889
5. 010000501616
6. 010000502150
7. 010000600193
8. 010000600872
9. 010000600876
10. 010000600877
[ 100719 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
100719(100.0%)
0 (0.0%)
NMCNTY [character]
1. Los Angeles County
2. Cook County
3. Maricopa County
4. Harris County
5. Orange County
6. Jefferson County
7. Montgomery County
8. Washington County
9. Wayne County
10. Dallas County
[ 1949 others ]
2264(2.2%)
1388(1.4%)
1256(1.2%)
1142(1.1%)
1074(1.1%)
980(1.0%)
888(0.9%)
848(0.8%)
817(0.8%)
814(0.8%)
89258(88.6%)
0 (0.0%)
SURVYEAR [character] 1. 2017-2018
100729(100.0%)
0 (0.0%)
STABR [character]
1. CA
2. TX
3. NY
4. FL
5. IL
6. MI
7. OH
8. PA
9. NC
10. NJ
[ 46 others ]
10323(10.2%)
9320(9.3%)
4808(4.8%)
4375(4.3%)
4245(4.2%)
3734(3.7%)
3610(3.6%)
2990(3.0%)
2691(2.7%)
2595(2.6%)
52038(51.7%)
0 (0.0%)
LEAID [character]
1. 7200030
2. 0622710
3. 1709930
4. 1200390
5. 3200060
6. 1200180
7. 1200870
8. 1500030
9. 4823640
10. 1201500
[ 17451 others ]
1121(1.1%)
1009(1.0%)
655(0.7%)
537(0.5%)
381(0.4%)
336(0.3%)
320(0.3%)
294(0.3%)
284(0.3%)
268(0.3%)
95524(94.8%)
0 (0.0%)
ST_LEAID [character]
1. PR-01
2. CA-1964733
3. IL-15-016-2990-25
4. FL-13
5. NV-02
6. FL-06
7. FL-29
8. HI-001
9. TX-101912
10. FL-50
[ 17451 others ]
1121(1.1%)
1009(1.0%)
655(0.7%)
537(0.5%)
381(0.4%)
336(0.3%)
320(0.3%)
294(0.3%)
284(0.3%)
268(0.3%)
95524(94.8%)
0 (0.0%)
LEA_NAME [character]
1. PUERTO RICO DEPARTMENT OF
2. Los Angeles Unified
3. City of Chicago SD 299
4. DADE
5. CLARK COUNTY SCHOOL DISTR
6. BROWARD
7. HILLSBOROUGH
8. Hawaii Department of Educ
9. HOUSTON ISD
10. PALM BEACH
[ 17147 others ]
1121(1.1%)
1009(1.0%)
655(0.7%)
537(0.5%)
381(0.4%)
336(0.3%)
320(0.3%)
294(0.3%)
284(0.3%)
268(0.3%)
95524(94.8%)
0 (0.0%)
SCH_NAME [character]
1. Lincoln Elementary School
2. Lincoln Elementary
3. Jefferson Elementary
4. Washington Elementary
5. Washington Elementary Sch
6. Central Elementary School
7. Jefferson Elementary Scho
8. Lincoln Elem School
9. Central High School
10. Roosevelt Elementary
[ 88366 others ]
64(0.1%)
61(0.1%)
53(0.1%)
49(0.0%)
46(0.0%)
42(0.0%)
33(0.0%)
33(0.0%)
32(0.0%)
32(0.0%)
100284(99.6%)
0 (0.0%)
LSTREET1 [character]
1. 6420 E. Broadway Blvd. Su
2. Box DOE
3. 2405 FAIRVIEW SCHOOL RD
4. 1820 XENIUM LN N
5. Main St
6. 335 ALTERNATIVE LN
7. 2101 N TWYMAN RD
8. 720 9TH AVE
9. 50 Moreland Rd.
10. 951 W Snowflake Blvd
[ 92384 others ]
33(0.0%)
28(0.0%)
22(0.0%)
19(0.0%)
13(0.0%)
12(0.0%)
11(0.0%)
11(0.0%)
10(0.0%)
10(0.0%)
100560(99.8%)
0 (0.0%)
LSTREET2 [character]
1. Suite B
2. Ste. 100
3. P.O. Box 1497
4. Suite A
5. Suite 200
6. Building B
7. Ste. 102
8. Ste. A
9. Suite 1
10. SUITE 111 HART
[ 482 others ]
8(1.4%)
7(1.2%)
6(1.0%)
6(1.0%)
5(0.8%)
4(0.7%)
4(0.7%)
4(0.7%)
4(0.7%)
4(0.7%)
540(91.2%)
100137 (99.4%)
LSTREET3 [logical]
All NA's
100729 (100.0%)
LCITY [character]
1. HOUSTON
2. Chicago
3. Los Angeles
4. BROOKLYN
5. SAN ANTONIO
6. Phoenix
7. BRONX
8. DALLAS
9. NEW YORK
10. Tucson
[ 14624 others ]
783(0.8%)
664(0.7%)
577(0.6%)
569(0.6%)
520(0.5%)
446(0.4%)
441(0.4%)
378(0.4%)
359(0.4%)
330(0.3%)
95662(95.0%)
0 (0.0%)
LSTATE [character]
1. CA
2. TX
3. NY
4. FL
5. IL
6. MI
7. OH
8. PA
9. NC
10. NJ
[ 45 others ]
10325(10.3%)
9320(9.3%)
4808(4.8%)
4377(4.3%)
4245(4.2%)
3736(3.7%)
3610(3.6%)
2990(3.0%)
2693(2.7%)
2595(2.6%)
52030(51.7%)
0 (0.0%)
LZIP [character]
1. 85710
2. 10456
3. 85364
4. 78521
5. 78572
6. 78577
7. 00731
8. 10457
9. 78539
10. 60623
[ 22526 others ]
53(0.1%)
45(0.0%)
44(0.0%)
43(0.0%)
42(0.0%)
41(0.0%)
39(0.0%)
37(0.0%)
37(0.0%)
36(0.0%)
100312(99.6%)
0 (0.0%)
LZIP4 [character]
1. 8888
2. 1199
3. 1299
4. 9801
5. 2099
6. 1399
7. 1699
8. 1599
9. 1499
10. 1899
[ 8615 others ]
899(1.5%)
113(0.2%)
111(0.2%)
106(0.2%)
104(0.2%)
101(0.2%)
100(0.2%)
99(0.2%)
94(0.2%)
89(0.2%)
57411(96.9%)
41502 (41.2%)
PHONE [character]
1. (505)880-3744
2. (520)225-6060
3. (505)721-1051
4. (480)461-4000
5. (972)316-3663
6. (505)527-5800
7. (520)745-4588
8. (480)497-3300
9. (623)445-5000
10. (480)484-6100
[ 91818 others ]
141(0.1%)
63(0.1%)
36(0.0%)
35(0.0%)
34(0.0%)
33(0.0%)
33(0.0%)
29(0.0%)
28(0.0%)
27(0.0%)
100270(99.5%)
0 (0.0%)
GSLO [character]
1. PK
2. KG
3. 09
4. 06
5. 07
6. 05
7. 03
8. 04
9. M
10. 01
[ 8 others ]
31179(31.0%)
23839(23.7%)
16627(16.5%)
12912(12.8%)
5441(5.4%)
2578(2.6%)
1581(1.6%)
1165(1.2%)
1113(1.1%)
964(1.0%)
3330(3.3%)
0 (0.0%)
GSHI [character]
1. 05
2. 12
3. 08
4. 06
5. 04
6. 02
7. 03
8. PK
9. M
10. N
[ 9 others ]
28039(27.8%)
26443(26.3%)
21860(21.7%)
10873(10.8%)
3938(3.9%)
1591(1.6%)
1446(1.4%)
1430(1.4%)
1113(1.1%)
796(0.8%)
3200(3.2%)
0 (0.0%)
VIRTUAL [character]
1. A virtual school
2. Missing
3. Not a virtual school
4. Not Applicable
656(0.7%)
183(0.2%)
99049(98.3%)
841(0.8%)
0 (0.0%)
TOTFRL [numeric]
Mean (sd) : 249.4 (275.2)
min ≤ med ≤ max:
-9 ≤ 178 ≤ 9626
IQR (CV) : 297 (1.1)
1906 distinct values 0 (0.0%)
FRELCH [numeric]
Mean (sd) : 221.6 (253.9)
min ≤ med ≤ max:
-9 ≤ 149 ≤ 7581
IQR (CV) : 272 (1.1)
1765 distinct values 0 (0.0%)
REDLCH [numeric]
Mean (sd) : 26 (36.9)
min ≤ med ≤ max:
-9 ≤ 16 ≤ 2045
IQR (CV) : 37 (1.4)
399 distinct values 0 (0.0%)
PK [numeric]
Mean (sd) : 34.8 (53.5)
min ≤ med ≤ max:
0 ≤ 22 ≤ 1912
IQR (CV) : 43 (1.5)
468 distinct values 64621 (64.2%)
KG [numeric]
Mean (sd) : 65 (46.9)
min ≤ med ≤ max:
0 ≤ 62 ≤ 948
IQR (CV) : 57 (0.7)
393 distinct values 43684 (43.4%)
G01 [numeric]
Mean (sd) : 64.4 (44.8)
min ≤ med ≤ max:
0 ≤ 62 ≤ 1408
IQR (CV) : 56 (0.7)
353 distinct values 43333 (43.0%)
G02 [numeric]
Mean (sd) : 64.6 (44.4)
min ≤ med ≤ max:
0 ≤ 63 ≤ 688
IQR (CV) : 56 (0.7)
345 distinct values 43268 (43.0%)
G03 [numeric]
Mean (sd) : 66.4 (46.3)
min ≤ med ≤ max:
0 ≤ 64 ≤ 783
IQR (CV) : 59 (0.7)
358 distinct values 43253 (42.9%)
G04 [numeric]
Mean (sd) : 67.9 (48.7)
min ≤ med ≤ max:
0 ≤ 65 ≤ 877
IQR (CV) : 61 (0.7)
382 distinct values 43470 (43.2%)
G05 [numeric]
Mean (sd) : 69.7 (56.7)
min ≤ med ≤ max:
0 ≤ 64 ≤ 985
IQR (CV) : 65 (0.8)
494 distinct values 44673 (44.3%)
G06 [numeric]
Mean (sd) : 91.5 (108.4)
min ≤ med ≤ max:
0 ≤ 56 ≤ 1155
IQR (CV) : 111 (1.2)
641 distinct values 58585 (58.2%)
G07 [numeric]
Mean (sd) : 102.7 (126.2)
min ≤ med ≤ max:
0 ≤ 52 ≤ 1439
IQR (CV) : 153 (1.2)
687 distinct values 63682 (63.2%)
G08 [numeric]
Mean (sd) : 101.9 (127.1)
min ≤ med ≤ max:
0 ≤ 50 ≤ 1608
IQR (CV) : 152 (1.2)
700 distinct values 63449 (63.0%)
G09 [numeric]
Mean (sd) : 124.7 (185.8)
min ≤ med ≤ max:
0 ≤ 40 ≤ 2799
IQR (CV) : 166 (1.5)
987 distinct values 68499 (68.0%)
G10 [numeric]
Mean (sd) : 120.4 (178.1)
min ≤ med ≤ max:
0 ≤ 39 ≤ 1837
IQR (CV) : 157 (1.5)
945 distinct values 68706 (68.2%)
G11 [numeric]
Mean (sd) : 115.4 (170.1)
min ≤ med ≤ max:
0 ≤ 40 ≤ 1719
IQR (CV) : 149 (1.5)
914 distinct values 68720 (68.2%)
G12 [numeric]
Mean (sd) : 114.1 (165.5)
min ≤ med ≤ max:
0 ≤ 43 ≤ 2580
IQR (CV) : 150 (1.5)
891 distinct values 68814 (68.3%)
G13 [logical]
1. FALSE
2. TRUE
36(97.3%)
1(2.7%)
100692 (100.0%)
TOTAL [numeric]
Mean (sd) : 515.7 (450.2)
min ≤ med ≤ max:
0 ≤ 434 ≤ 14286
IQR (CV) : 408 (0.9)
2945 distinct values 2229 (2.2%)
MEMBER [numeric]
Mean (sd) : 515.6 (449.9)
min ≤ med ≤ max:
0 ≤ 434 ≤ 14286
IQR (CV) : 408 (0.9)
2944 distinct values 2229 (2.2%)
AM [numeric]
Mean (sd) : 6.7 (30.3)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1395
IQR (CV) : 4 (4.5)
424 distinct values 20609 (20.5%)
HI [numeric]
Mean (sd) : 142.5 (240.6)
min ≤ med ≤ max:
0 ≤ 49 ≤ 4677
IQR (CV) : 160 (1.7)
1745 distinct values 3852 (3.8%)
BL [numeric]
Mean (sd) : 83 (151.4)
min ≤ med ≤ max:
0 ≤ 19 ≤ 5088
IQR (CV) : 90 (1.8)
1166 distinct values 8325 (8.3%)
WH [numeric]
Mean (sd) : 247.9 (275.1)
min ≤ med ≤ max:
0 ≤ 182 ≤ 8146
IQR (CV) : 312 (1.1)
1839 distinct values 3993 (4.0%)
HP [numeric]
Mean (sd) : 3.1 (24.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 1394
IQR (CV) : 2 (8)
305 distinct values 30008 (29.8%)
TR [numeric]
Mean (sd) : 20.7 (27.3)
min ≤ med ≤ max:
0 ≤ 12 ≤ 1228
IQR (CV) : 24 (1.3)
307 distinct values 7137 (7.1%)
FTE [numeric]
Mean (sd) : 32.6 (25.6)
min ≤ med ≤ max:
0 ≤ 27.6 ≤ 1419
IQR (CV) : 24 (0.8)
10066 distinct values 5233 (5.2%)
LATCOD [numeric]
Mean (sd) : 37.8 (5.8)
min ≤ med ≤ max:
-14.3 ≤ 38.8 ≤ 71.3
IQR (CV) : 7.7 (0.2)
96746 distinct values 0 (0.0%)
LONCOD [numeric]
Mean (sd) : -92.9 (16.9)
min ≤ med ≤ max:
-176.6 ≤ -89.3 ≤ 144.9
IQR (CV) : 20.2 (-0.2)
96911 distinct values 0 (0.0%)
ULOCALE [character]
1. 21-Suburb: Large
2. 11-City: Large
3. 41-Rural: Fringe
4. 42-Rural: Distant
5. 13-City: Small
6. 43-Rural: Remote
7. 32-Town: Distant
8. 12-City: Mid-size
9. 33-Town: Remote
10. 22-Suburb: Mid-size
[ 2 others ]
26772(26.6%)
14851(14.7%)
11179(11.1%)
10279(10.2%)
6635(6.6%)
6412(6.4%)
6266(6.2%)
5876(5.8%)
4138(4.1%)
3305(3.3%)
5016(5.0%)
0 (0.0%)
STUTERATIO [numeric]
Mean (sd) : 16.9 (85.7)
min ≤ med ≤ max:
0 ≤ 15.3 ≤ 22350
IQR (CV) : 5.3 (5.1)
3854 distinct values 6835 (6.8%)
STITLEI [character]
1. Missing
2. No
3. Not Applicable
4. Yes
864(0.9%)
14596(14.5%)
29199(29.0%)
56070(55.7%)
0 (0.0%)
AMALM [numeric]
Mean (sd) : 3.7 (16.1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 743
IQR (CV) : 2 (4.4)
268 distinct values 26365 (26.2%)
AMALF [numeric]
Mean (sd) : 3.6 (15.5)
min ≤ med ≤ max:
0 ≤ 1 ≤ 652
IQR (CV) : 2 (4.4)
263 distinct values 26708 (26.5%)
ASALM [numeric]
Mean (sd) : 15.9 (45.2)
min ≤ med ≤ max:
0 ≤ 3 ≤ 1997
IQR (CV) : 11 (2.8)
522 distinct values 16162 (16.0%)
ASALF [numeric]
Mean (sd) : 15.1 (42.5)
min ≤ med ≤ max:
0 ≤ 3 ≤ 1532
IQR (CV) : 11 (2.8)
495 distinct values 16080 (16.0%)
HIALM [numeric]
Mean (sd) : 73.7 (123.5)
min ≤ med ≤ max:
0 ≤ 25 ≤ 2292
IQR (CV) : 83 (1.7)
1073 distinct values 4774 (4.7%)
HIALF [numeric]
Mean (sd) : 70.5 (118.7)
min ≤ med ≤ max:
0 ≤ 24 ≤ 2461
IQR (CV) : 79 (1.7)
1047 distinct values 5121 (5.1%)
BLALM [numeric]
Mean (sd) : 43.5 (77.3)
min ≤ med ≤ max:
0 ≤ 11 ≤ 2473
IQR (CV) : 48 (1.8)
687 distinct values 10801 (10.7%)
BLALF [numeric]
Mean (sd) : 42.1 (76.8)
min ≤ med ≤ max:
0 ≤ 10 ≤ 2615
IQR (CV) : 46 (1.8)
693 distinct values 11485 (11.4%)
WHALM [numeric]
Mean (sd) : 128.6 (140.5)
min ≤ med ≤ max:
0 ≤ 95 ≤ 3854
IQR (CV) : 160 (1.1)
1046 distinct values 4502 (4.5%)
WHALF [numeric]
Mean (sd) : 120.8 (135.6)
min ≤ med ≤ max:
0 ≤ 88 ≤ 4292
IQR (CV) : 152 (1.1)
1030 distinct values 4682 (4.6%)
HPALM [numeric]
Mean (sd) : 1.7 (13.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 751
IQR (CV) : 1 (7.9)
210 distinct values 34182 (33.9%)
HPALF [numeric]
Mean (sd) : 1.6 (12.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 643
IQR (CV) : 1 (7.7)
212 distinct values 34563 (34.3%)
TRALM [numeric]
Mean (sd) : 10.8 (13.9)
min ≤ med ≤ max:
0 ≤ 6 ≤ 512
IQR (CV) : 13 (1.3)
174 distinct values 9200 (9.1%)
TRALF [numeric]
Mean (sd) : 10.5 (14)
min ≤ med ≤ max:
0 ≤ 6 ≤ 716
IQR (CV) : 12 (1.3)
183 distinct values 9477 (9.4%)
TOTMENROL [numeric]
Mean (sd) : 264.9 (229)
min ≤ med ≤ max:
0 ≤ 224 ≤ 6890
IQR (CV) : 210 (0.9)
1691 distinct values 2296 (2.3%)
TOTFENROL [numeric]
Mean (sd) : 251.1 (222.8)
min ≤ med ≤ max:
0 ≤ 211 ≤ 7396
IQR (CV) : 200 (0.9)
1646 distinct values 2362 (2.3%)
STATUS [numeric]
Mean (sd) : 1.1 (0.6)
min ≤ med ≤ max:
1 ≤ 1 ≤ 8
IQR (CV) : 0 (0.5)
1:98557(97.8%)
3:1103(1.1%)
4:77(0.1%)
5:110(0.1%)
6:500(0.5%)
7:341(0.3%)
8:41(0.0%)
0 (0.0%)
UG [numeric]
Mean (sd) : 11.2 (33.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 1017
IQR (CV) : 10 (3)
217 distinct values 88689 (88.0%)
AE [logical]
1. FALSE
2. TRUE
60(93.8%)
4(6.2%)
100665 (99.9%)
SCHOOL_TYPE_TEXT [character]
1. Alternative/other school
2. Regular school
3. Special education school
4. Vocational school
5531(5.5%)
91737(91.1%)
1948(1.9%)
1513(1.5%)
0 (0.0%)
SY_STATUS_TEXT [character]
1. Currently operational
2. New school
3. School has changed agency
4. School has reopened
5. School temporarily closed
6. School to be operational
7. School was operational bu
98557(97.8%)
1103(1.1%)
110(0.1%)
41(0.0%)
500(0.5%)
341(0.3%)
77(0.1%)
0 (0.0%)
SCHOOL_LEVEL [character]
1. Adult Education
2. Elementary
3. High
4. Middle
5. Not Applicable
6. Not Reported
7. Other
8. Prekindergarten
9. Secondary
10. Ungraded
28(0.0%)
53287(52.9%)
22977(22.8%)
16506(16.4%)
796(0.8%)
1113(1.1%)
3824(3.8%)
1430(1.4%)
602(0.6%)
166(0.2%)
0 (0.0%)
AS [numeric]
Mean (sd) : 29.8 (85.8)
min ≤ med ≤ max:
0 ≤ 5 ≤ 3529
IQR (CV) : 21 (2.9)
850 distinct values 12717 (12.6%)
CHARTER_TEXT [character]
1. No
2. Not Applicable
3. Yes
87007(86.4%)
6387(6.3%)
7335(7.3%)
0 (0.0%)
MAGNET_TEXT [character]
1. Missing
2. No
3. Not Applicable
4. Yes
6256(6.2%)
77531(77.0%)
13520(13.4%)
3422(3.4%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-22

Briefly describe the data

The dataset contains details about the public school education. It has 100729 observations and 79 variables. The data contains many variables that could be renamed for the sake of understanding easily. Some of the variables need to be turned into factors. In this challenge, I will be focusing on the variables of state and school level so I will be performing the changes to only these two variables.I also made a smaller dataframe with two states in order to compare observations between the two states.
One issue faced while changing the categories in the School Education Level variable was that there were secondary, middle and high school mentioned. I assumed that secondary should include both middle and high school, but the number of observations for middle and high school do not add up to the observations present for secondary school. Hence, I have kept all three in the dataset,

Tidy Data (as needed)

PublicSchoolChar<-PublicSchoolChar%>%
  rename( State= STABR )
PublicSchoolChar<-PublicSchoolChar%>%select(State,SCHOOL_LEVEL,everything())
level <- unique(PublicSchoolChar$SCHOOL_LEVEL)
level
 [1] "Elementary"      "High"            "Other"           "Not Reported"   
 [5] "Middle"          "Secondary"       "Prekindergarten" "Not Applicable" 
 [9] "Ungraded"        "Adult Education"
PublicSchoolChar<-PublicSchoolChar%>%
  mutate(Levels = factor(SCHOOL_LEVEL, 
                       labels=level[c(4,8,9,7,1,6,5,2,10,3)]))%>%
  select(-SCHOOL_LEVEL)
rm(level)

table(PublicSchoolChar$Levels)

   Not Reported  Not Applicable        Ungraded Prekindergarten      Elementary 
             28           53287           22977           16506             796 
      Secondary          Middle            High Adult Education           Other 
           1113            3824            1430             602             166 
State2<-PublicSchoolChar%>%filter(State == "MA"|State=="NJ")
State2

Univariate Visualisations

Here the number of observations in each state can be seen. The distribution of the different school levels is also visible. However, it can be seen that the majority of the observations are not applicable, so essentially they are missing.

ggplot(PublicSchoolChar, aes(State)) + geom_bar()

ggplot(PublicSchoolChar, aes(Levels)) + geom_bar()

Bivariate Visualisations (Doubt)

In order to make a bivariate visualisation, a continuous variable is also required. However, I am unable to understand what data is represented from the column names in the dataset and hence was unable to complete this step.

:::